Homework 3: Fake News Classification
This really says a lot about our society.
Fake News Classification
Ah yes, here’s something that everyone who’s doing machine learning learning has done: trying to classify fake news based on its contents. It’s a very natural thing to try since you hear about it all the time, and maybe it’s something simple enough that machines can pick out with just ameteur level resources and skills.
Even putting aside the importance in society, a large part of the reason this is such a popular machine learning exercise is how simple it is to frame as a machine learning task. Our predictors (what the machine will be privy to) is of course exactly the information you’d be seeing if you were scrolling through your news articles: the headlines and the body. And of course, the objective will be 0/1 true/fake labels.
Speaking of which, I should note: the original dataset came from Kaggle where it has already been labeled (information from a paper on this topic), but it has been reconfigured and cleaned and for us. Luckily, that better version is directly available on the internet on Github, so you can directly follow this blog post if you’d like.
Either way, one thing we’ll be exploring in this tutorial other than just how to build and run the model, is actually the very common question around any machine learning task, which is exactly how much information should the machine know about? It might seem natural that the more information the better, but of course, that is the trap of overfitting. (There’s a nice simple Stack Exchange on this overfitting topic that explains it quite consisely, and our professor discussed the topic pretty well as it relates to machine learning in Python specifically.
Either way though, first thing’s first - packages. Make sure you have at least TensorFlow 2.4 installed, otherwise you won’t be able to use some of the text vectorization tools we’re going to need.
pip install tensorflow==2.4
We’re going to start off then like any good machine learning tutorial should: importing tensorflow + friends numpy and pandas.
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow import keras
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
# for embedding viz
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
Then we can actually import the dataset to see what we’re dealing with.
train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"
df = pd.read_csv(train_url)[["title","text","fake"]] # only need these columns
df.head() # the other column is just some index
| title | text | fake | |
|---|---|---|---|
| 0 | Merkel: Strong result for Austria's FPO 'big c... | German Chancellor Angela Merkel said on Monday... | 0 |
| 1 | Trump says Pence will lead voter fraud panel | WEST PALM BEACH, Fla.President Donald Trump sa... | 0 |
| 2 | JUST IN: SUSPECTED LEAKER and “Close Confidant... | On December 5, 2017, Circa s Sara Carter warne... | 1 |
| 3 | Thyssenkrupp has offered help to Argentina ove... | Germany s Thyssenkrupp, has offered assistance... | 0 |
| 4 | Trump say appeals court decision on travel ban... | President Donald Trump on Thursday called the ... | 0 |
The data is already pretty nice and clean, but there’s one more step we should take at the outset: removing stop words. These are words like is, the, and at which are really common grammatically but don’t usually contribute to the overall meaning of text.
Of course, we’ll be relying on a list of standard stop words someone else has made, in particular, one from nltk (Natural Language Toolkit). In case you don’t already have this, you’ll have to download their list of stopwords.
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\Michael\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
from nltk.corpus import stopwords
stop = stopwords.words('english')
And this really is just a Python list of words we can remove.
stop[:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
We know how to work with lists, so removing everything in here from each entry in df["text"] is as easy as using an apply.
clean_text = df["text"].apply(lambda x: " ".join([item for item in x.split() if item not in stop])).to_frame()
(We need that to_frame at the end there to Series (column) resulting from the apply to a fully fledged DataFrame, because this is what TensorFlow wants.)
Essentially, this breaks apart each entire text body into a list of words, throws out ones that are stop words, and stitches it back together.
clean_text.iloc[0].text
'German Chancellor Angela Merkel said Monday strong showing Austria anti-immigrant Freedom Party (FPO) Sunday election big challenge parties. Speaking news conference Berlin, Merkel added hoping close cooperation Austria conservative election winner Sebastian Kurz European level.'
So as you can see, the text is a little less human-readable now, but all the important words are there, and that’s all the computer needs.
Create a (Tensorflow) Dataset
As another step in our data preparation, we’re going to create a Tensorflow Dataset. These are really nice in that it “abstracts” (and you’ll hear that term the more you work around specialized computer software stuff like this) away the nitty-gritty of having many different elements in a machine learning pipeline, and simply allows us keep lots of differently shaped data all in one place that’s easy to reference.
We won’t be using the full power of what these Datasets can do, but the nicest way in which they help us here is that we’ll use it to help us answer that question we posed in the beginning where we were interested in which features to include. So that naturally means we’re going to want to include a separate inputs for the headline titles, and one for body text. (And then of course, the labels of fakeness as our answer key).
data = tf.data.Dataset.from_tensor_slices(
(
{
"title" : df[["title"]],
"text" : clean_text
},
{
"fake" : df[["fake"]]
}
)
)
And just because this will be useful when we bring in a separate testing dataset, we should consolidate this work of getting from csv to tf.data.Dataset into a function.
def to_dataset(df):
df = df[['text','title', "fake"]].apply(lambda x: [item for item in x if item not in stop])
data = tf.data.Dataset.from_tensor_slices(
({
"title" : df[["title"]],
"text" : df[["text"]]
},
{
"fake" : df[["fake"]]
}))
# batch the dataset to increase the speed of training
data = data.batch(100)
return data
I didn’t make this function out of laziness, but I realized that we actually had to import a second dataset later for testing, so this would help quite a bit. It’s also just cleaner. Within this function as well, I learned about using
data = data.batch(100)
which will help with speed because we’ll batch together inputs when training.
Test-Train Split
Again, as is tradition in machine learning, we’ll split our entire Dataset into testing, training, and validation subsets to more easily control what our model has access to, as well as to have unseen data against which to
This is mainly for the purposes of avoiding overfitting again, but a much more detailed discussion on why this splitting helps can once again be found in Professor Chodrow’s lecture notes.
But the crux is that we’re going to pick a random 80% of the data for training, and the rest will be testing data.
train_size = int(0.8*len(data))
The Dataset class has really nice built in functionality for this. To ensure taking random chunks, we can first shuffle the data around using
data = data.shuffle(buffer_size = len(data))
Then finally, we take the first 80% as our training set, and to get the remaining 20%, we want to skip over the first 80% reserved for training. This is easily accomplished with the skip method, which does exactly that.
train = data.take(train_size)
val = data.skip(train_size)
len(train), len(val)
(17959, 4490)
Neural Network Layers
The main way in which we’ll train our models is with neural networks. These are once again a much huger topic than I’ll go into here, but allow me to go on a little rant. Most of the people trying to advertise all this machine learning stuff to new learners are always saying things like “oh yes, neural networks/regression/random forests are actually really simple”. But you and I know is a blatant lie, do you really think people would devote their lives to research on machine learning if regression really was just “fitting a curve”? Most of the time, people are just hiding lots of the scarier details to entice people, but this is in my opinion why so many people get turned off of machine learning learning nowadays because their expectations for working with TensorFlow and stuff are built up really high when there really isn’t a button that just says “do regression on my data, please”.
All this to say: I’m acknowledging I’m not going to do all the machine learning explanations here much justice, but I’ll present the headlines for each topic.
So as far as this tutorial is concerned,
A neural network consists of a bunch of “layers” (calculations with weights) the data is pushed through.
and we’ll let TensorFlow take care of the rest.
The “machine learning” just comes in when we readjust those association weights to minimize the difference between our predictions (what we have at the last layer) and the true labels we know externally.
And layers aren’t all the same either, they can have their own specialized roles to play in how the data is processed, affording us lots of flexibility in tuning modular parts of the whole model. Of course then a model is only as good as the layers we put together for it, so we’ll want to make we give them enough power to process our text documents well. The critical layers we’ll be using in our models are vectorization and embedding.
Text Processing
Before we actually talk about vectorization, we need to create a standardization function to make sure our text doesn’t have a lot of meaningless noise that could confuse the computer still lying around after removing stop words. While stop words might have removed meaningless parts of vocabulary and grammar, there still might be things like difference in punctuation and capitalization (for instance, you may write “Machine Learning” while I say “machine learning”, but we still want to count those the same).
To make a nice function that does this for us.
import re
import string
def standardization(input_data):
lowercase = tf.strings.lower(input_data)
no_punctuation = tf.strings.regex_replace(lowercase,
'[%s]' % re.escape(string.punctuation),'')
return no_punctuation
For example,
standardization("Hoppin' out the Wraith, esskeetit").numpy()
b'hoppin out the wraith esskeetit'
Now we’re ready for the actual vectorization. As I mentioned in the previous blog post, we like to put our data into formats computers/math like to work with, like vectors or matrices. One way of doing this is by ranking every single important word by how freqently it appears in our data. For example, one “datapoint” of text:
Aliens Sighted ~in~ LA
could be transformed into
[1900 900 500]
(“in” is probably a stop word).
where “aliens” is the 1900th most common word, “sighted” is the 900th, and “LA” is 500th.
And if you think about it, this is actually a really nice way to condense information about what a word means to humans into a really nice format for computers. For example, if “aliens” is the 1900th word in frequency out of 2000, and are hardly mentioned in news that we’re pretty sure is real, and then all of a sudden is the 10th most common unique word in another article, that document would be rather sus.
It’s nice to know about how the data is being transformed, but all the technical details will be taken care of by that TensorFlow TextVectorization layer we imported earlier. You can read its documentation, but these are the arguments we need to provide:
# only the top distinct words will be tracked
max_tokens = 2000
# each headline will be a vector of length 25
sequence_length = 25
vectorize_layer = TextVectorization(
standardize=standardization,
max_tokens=max_tokens, # only consider this many words
output_mode='int',
output_sequence_length=sequence_length)
The main highlights are that we need to provide a max_tokens, which indicates how many words we’ll have in our “vocabulary” (really rare words like Thyssenkrupp probably won’t help our model learn too much), the standardize parameter that takes in the standardization function we built earlier, and the output mode of int to indicate we wanted those numeric rankings.
And then getting the vectorization layer to calculate all those overall rankings for our data is as simple as using its adapt function on our input text.
headlines = train.map(lambda x, y: x["title"])
vectorize_layer.adapt(headlines)
Note, remember what I was saying earlier about how layers in neural networks can have their own special purpose?
Finally, we’ll create keras.Input objects to distinguish between places where we input title (headline) data, and text (body) data. Even though they’re both string data, it’s nice that we have two different “categorizations” of the data, because these are fundamentally different objects, where words that appear in one place might not mean the same as the same words appearing in the other place.
title_input = keras.Input(
shape = (1,),
name = "title",
dtype = "string"
)
text_input = keras.Input(
shape = (1,),
name = "text",
dtype = "string"
)
Modeling
Now we can finally start putting together the models, which will once again be constructed form putting together many layers, which involves instantiating lots of different layers, and repeatedly calling them on our input. We’ll make this into a function, because we’ll be using the same overall layer architecture for all the model experiments we’ll end up creating.
Note that we can actually name layers for easy reference later. In fact, it’s important that we name the last output layer the same as the name of our labels fake so that TensorFlow knows that those really are the final outputs.
Only Headlines
def create_layers(input_data, embedding_layer=None):
if not embedding_layer:
embedding_layer = layers.Embedding(max_tokens, 20, name = "embedding")
features = vectorize_layer(input_data)
features = embedding_layer(features)
features = layers.Dropout(0.2)(features)
features = layers.GlobalAveragePooling1D()(features)
features = layers.Dropout(0.2)(features)
features = layers.Dense(32, activation='relu')(features)
output = layers.Dense(2, name="fake" if embedding_layer is None else None)(features)
return output
I recommended to some people that they make a function to create the layers because it’s a lot of repeated lines of code across creating the three different models.
A couple numbers to note here: the Embedding layer has 20 dimensions, and the final Dense layer has 2, the number of “classifications” for each article (fake or real).
In case you’re wondering what those layers Dropout and Dense mean, those are less important than the ones I highlighted already, but the highlights are that
- Dense is a kind of layer they call “densely connected”, where every input node is connected to every output node, or in other words, every input affects every part of the next layer.
- Dropout is a technique where randomly selected neurons are “dropped” during training randomly. This means that their contribution to the activation of later layers is temporally removed on the forward pass and any weight updates are not applied to the neuron. This might sound like a bad thing, but the overall reward is that the network becomes less sensitive to the specific weights of nodes, i.e., less sensitive to overfitting the training data.
Especially when we have so many layers stacked together, I think this can be some of the most intimidating parts about getting into machine learning: how the heck did we come up with these layers, and what they mean? It’s easy to see what things like print or even ' '.join does just by looking up documentation, but documentation for these layers as a beginner can be … less than helpful.
So I recommended some ways to explain this layers part to peers because many times this really was just copied from lecture notes, and while that’s fine, it might still be intimidating.
Either way, now that we have the output as its own variable after stacking several layers before it, we can feed it to a keras class that will scaffold a fully fledged model around our specified layers.
only_title_model = keras.Model(
inputs = [title_input],
outputs = create_layers(title_input)
)
Then we have to “compile the model”, turn all the layers into a single “machine” that the TensowFlow can feed and tune to help it learn. The important part here is that we also specify a optimzer+loss function to perform the learning. (Those concepts are again explained in the PIC16A lecture notes).
only_title_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
The highlight of this is that machine learning = function optimization, and we’re just specifying how to optimize, and how to see how far or close we’re moving from the optimal value.
Then finally, we can perform the “learning”, or fitting. We feed the training data
history = only_title_model.fit(train, epochs = 20, validation_data = val)
Epoch 1/20
D:\anaconda3\envs\PIC16B\lib\site-packages\tensorflow\python\keras\engine\functional.py:595: UserWarning: Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model.
[n for n in tensors.keys() if n not in ref_input_names])
17959/17959 [==============================] - 21s 979us/step - loss: 0.2472 - accuracy: 0.8996 - val_loss: 0.1037 - val_accuracy: 0.9621
Epoch 2/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.1140 - accuracy: 0.9556 - val_loss: 0.0875 - val_accuracy: 0.9686
Epoch 3/20
17959/17959 [==============================] - 17s 923us/step - loss: 0.1052 - accuracy: 0.9595 - val_loss: 0.0890 - val_accuracy: 0.9686
Epoch 4/20
17959/17959 [==============================] - 16s 915us/step - loss: 0.0947 - accuracy: 0.9648 - val_loss: 0.0792 - val_accuracy: 0.9690
Epoch 5/20
17959/17959 [==============================] - 17s 958us/step - loss: 0.0939 - accuracy: 0.9663 - val_loss: 0.0813 - val_accuracy: 0.9679
Epoch 6/20
17959/17959 [==============================] - 17s 926us/step - loss: 0.0881 - accuracy: 0.9677 - val_loss: 0.0631 - val_accuracy: 0.9746
Epoch 7/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0814 - accuracy: 0.9698 - val_loss: 0.0669 - val_accuracy: 0.9739
Epoch 8/20
17959/17959 [==============================] - 21s 1ms/step - loss: 0.0854 - accuracy: 0.9693 - val_loss: 0.0738 - val_accuracy: 0.9746
Epoch 9/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0824 - accuracy: 0.9681 - val_loss: 0.0595 - val_accuracy: 0.9757
Epoch 10/20
17959/17959 [==============================] - 23s 1ms/step - loss: 0.0776 - accuracy: 0.9705 - val_loss: 0.0637 - val_accuracy: 0.9771
Epoch 11/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0769 - accuracy: 0.9705 - val_loss: 0.0671 - val_accuracy: 0.9742
Epoch 12/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0745 - accuracy: 0.9714 - val_loss: 0.0578 - val_accuracy: 0.9793
Epoch 13/20
17959/17959 [==============================] - 18s 1ms/step - loss: 0.0745 - accuracy: 0.9741 - val_loss: 0.0542 - val_accuracy: 0.9815
Epoch 14/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0739 - accuracy: 0.9733 - val_loss: 0.0597 - val_accuracy: 0.9791
Epoch 15/20
17959/17959 [==============================] - 18s 988us/step - loss: 0.0701 - accuracy: 0.9734 - val_loss: 0.0557 - val_accuracy: 0.9804
Epoch 16/20
17959/17959 [==============================] - 17s 930us/step - loss: 0.0781 - accuracy: 0.9723 - val_loss: 0.0602 - val_accuracy: 0.9768
Epoch 17/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0697 - accuracy: 0.9748 - val_loss: 0.0573 - val_accuracy: 0.9784
Epoch 18/20
17959/17959 [==============================] - 18s 973us/step - loss: 0.0695 - accuracy: 0.9739 - val_loss: 0.0551 - val_accuracy: 0.9786
Epoch 19/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0741 - accuracy: 0.9726 - val_loss: 0.0475 - val_accuracy: 0.9817
Epoch 20/20
17959/17959 [==============================] - 18s 1ms/step - loss: 0.0660 - accuracy: 0.9767 - val_loss: 0.0596 - val_accuracy: 0.9784
from matplotlib import pyplot as plt
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1db69d22948>

Predictions on Unseen Data
Let’s bring in some more unseen data for our algorithm to take a stab at. By unseen, I really do mean that the computer really never had access to this information to learn. I could have brought it in at the beginning, but just to emphasize, we never needed it till this point, so why not just bring it in now?
test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"
test = to_dataset(pd.read_csv(test_url))
With all the abstraction, we can now just ask TensorFlow to apply what our model learned to the unseen testing data with a single “button press”.
only_title_model.evaluate(test)
D:\anaconda3\envs\PIC16B\lib\site-packages\tensorflow\python\keras\engine\functional.py:595: UserWarning: Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model.
[n for n in tensors.keys() if n not in ref_input_names])
225/225 [==============================] - 1s 2ms/step - loss: 0.1090 - accuracy: 0.9600
[0.1089763417840004, 0.9599536657333374]
Nice, basically by just stacking up the right layers, we got some pretty great accuracy, around 96%.
Only Text
Now, we’ll perform the the exact same process with this time the text only. But thanks to our work in creating that function to create the layers for us earlier, this means that we just pass in the text_input to that function.
only_text_model = keras.Model(
inputs = [text_input],
outputs = create_layers(text_input)
)
only_text_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
history = only_text_model.fit(train, epochs = 20, validation_data = val)
Epoch 1/20
D:\anaconda3\envs\PIC16B\lib\site-packages\tensorflow\python\keras\engine\functional.py:595: UserWarning: Input dict contained keys ['title'] which did not match any model input. They will be ignored by the model.
[n for n in tensors.keys() if n not in ref_input_names])
17959/17959 [==============================] - 21s 1ms/step - loss: 0.2548 - accuracy: 0.8969 - val_loss: 0.1118 - val_accuracy: 0.9597
Epoch 2/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.1248 - accuracy: 0.9537 - val_loss: 0.1207 - val_accuracy: 0.9570
Epoch 3/20
17959/17959 [==============================] - 21s 1ms/step - loss: 0.1120 - accuracy: 0.9580 - val_loss: 0.0993 - val_accuracy: 0.9628
Epoch 4/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.1200 - accuracy: 0.9574 - val_loss: 0.1046 - val_accuracy: 0.9572
Epoch 5/20
17959/17959 [==============================] - 18s 1ms/step - loss: 0.1169 - accuracy: 0.9576 - val_loss: 0.0911 - val_accuracy: 0.9682
Epoch 6/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.1055 - accuracy: 0.9618 - val_loss: 0.0962 - val_accuracy: 0.9650
Epoch 7/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.1002 - accuracy: 0.9612 - val_loss: 0.0918 - val_accuracy: 0.9670
Epoch 8/20
17959/17959 [==============================] - 18s 1ms/step - loss: 0.1093 - accuracy: 0.9578 - val_loss: 0.0991 - val_accuracy: 0.9628
Epoch 9/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.1031 - accuracy: 0.9617 - val_loss: 0.0950 - val_accuracy: 0.9675
Epoch 10/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.1067 - accuracy: 0.9624 - val_loss: 0.0921 - val_accuracy: 0.9659
Epoch 11/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0990 - accuracy: 0.9647 - val_loss: 0.0936 - val_accuracy: 0.9619
Epoch 12/20
17959/17959 [==============================] - 18s 1ms/step - loss: 0.1044 - accuracy: 0.9628 - val_loss: 0.0819 - val_accuracy: 0.9690
Epoch 13/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0999 - accuracy: 0.9636 - val_loss: 0.0862 - val_accuracy: 0.9661
Epoch 14/20
17959/17959 [==============================] - 18s 1ms/step - loss: 0.0992 - accuracy: 0.9659 - val_loss: 0.0935 - val_accuracy: 0.9675
Epoch 15/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.1014 - accuracy: 0.9608 - val_loss: 0.0873 - val_accuracy: 0.9686
Epoch 16/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.1016 - accuracy: 0.9642 - val_loss: 0.0880 - val_accuracy: 0.9646
Epoch 17/20
17959/17959 [==============================] - 18s 1ms/step - loss: 0.0959 - accuracy: 0.9628 - val_loss: 0.0904 - val_accuracy: 0.9679
Epoch 18/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0972 - accuracy: 0.9656 - val_loss: 0.0944 - val_accuracy: 0.9641
Epoch 19/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0962 - accuracy: 0.9627 - val_loss: 0.0987 - val_accuracy: 0.9619
Epoch 20/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0930 - accuracy: 0.9649 - val_loss: 0.0941 - val_accuracy: 0.9695
from matplotlib import pyplot as plt
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1db6a823c88>

only_text_model.evaluate(test)
225/225 [==============================] - 3s 14ms/step - loss: 2.1707 - accuracy: 0.6363
[2.1707146167755127, 0.6362866759300232]
And this might come as a bit of a surprise, at least it did for me. That’s a big drop in accuracy! Shouldn’t the text of the article contain even more information than the title, so shouldn’t this be performing even better? At the end of the day, with the information I see right now, it seems a bit hard to point any fingers specifically with good evidence, but I’m suspecting fake news titles are actually more exaggerated or embellished in order to catch more eyes, and the machine learning algorithm might have had an easier time picking up on that.
Both Text and Title
Now, we’ll perform the the exact same process with this time the both title and text. Aren’t you glad we got that function to create layers for us?
Except this time, we do have to be a bit more careful here. The first reason is that we’re going to want to use a shared embedding layer, where we use the same text embedding for both the text input and the title input. (We’ve still yet to talk about what the embedding actually is, but for now just think of it as how the computer is converting strings into something it can understand).
shared_embedding = layers.Embedding(max_tokens, 10, name = "embedding")
title_features = create_layers(title_input, embedding_layer=shared_embedding)
text_features = create_layers(text_input, embedding_layer=shared_embedding)
The second reason is that the structure of our model has to change a bit now that we have two sets of layers in parallel with two separate outputs, so we’re going to have to eventually concantenate those outputs.
main = layers.concatenate([title_features, text_features], axis = 1)
main = layers.Dense(32, activation='relu')(main)
output = layers.Dense(2, name = "fake")(main)
And that concatenation, after passion through some more dense layers to reshape the output to have the same number of categories we want (1 for fake and 1 for real), we have our output layer.
title_and_text_model = keras.Model(
inputs = [title_input, text_input],
outputs = output
)
By the way, now that we’ve made the model, I’d like to show you a neat way of visualizing all the layers to see where the inputs and outputs are going, especially for this case where we have once again those two separate and parallel routes.
keras.utils.plot_model(title_and_text_model)

I didn’t know about this plot_model function, but after reading another person’s blog post, I found it, and realized it was really cool. Importantly, it’s nice to display especially here when I’m trying to help people learn about what a neural network is doing, especially in this more special case with more than 1 input.
title_and_text_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
history = title_and_text_model.fit(train, epochs = 20, validation_data = val)
Epoch 1/20
17959/17959 [==============================] - 22s 1ms/step - loss: 0.1882 - accuracy: 0.9111 - val_loss: 0.0418 - val_accuracy: 0.9869
Epoch 2/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0572 - accuracy: 0.9815 - val_loss: 0.0572 - val_accuracy: 0.9802
Epoch 3/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0438 - accuracy: 0.9849 - val_loss: 0.0224 - val_accuracy: 0.9933
Epoch 4/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0378 - accuracy: 0.9881 - val_loss: 0.0402 - val_accuracy: 0.9833
Epoch 5/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0363 - accuracy: 0.9874 - val_loss: 0.0294 - val_accuracy: 0.9909
Epoch 6/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0333 - accuracy: 0.9900 - val_loss: 0.0280 - val_accuracy: 0.9924
Epoch 7/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0321 - accuracy: 0.9898 - val_loss: 0.0357 - val_accuracy: 0.9853
Epoch 8/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0344 - accuracy: 0.9877 - val_loss: 0.0215 - val_accuracy: 0.9942
Epoch 9/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0316 - accuracy: 0.9902 - val_loss: 0.0179 - val_accuracy: 0.9940
Epoch 10/20
17959/17959 [==============================] - 22s 1ms/step - loss: 0.0280 - accuracy: 0.9904 - val_loss: 0.0170 - val_accuracy: 0.9947
Epoch 11/20
17959/17959 [==============================] - 22s 1ms/step - loss: 0.0282 - accuracy: 0.9914 - val_loss: 0.0156 - val_accuracy: 0.9953
Epoch 12/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0269 - accuracy: 0.9923 - val_loss: 0.0137 - val_accuracy: 0.9960
Epoch 13/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0259 - accuracy: 0.9922 - val_loss: 0.0162 - val_accuracy: 0.9951
Epoch 14/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0256 - accuracy: 0.9915 - val_loss: 0.0154 - val_accuracy: 0.9958
Epoch 15/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0251 - accuracy: 0.9909 - val_loss: 0.0123 - val_accuracy: 0.9955
Epoch 16/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0239 - accuracy: 0.9928 - val_loss: 0.0135 - val_accuracy: 0.9951
Epoch 17/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0195 - accuracy: 0.9938 - val_loss: 0.0184 - val_accuracy: 0.9940
Epoch 18/20
17959/17959 [==============================] - 19s 1ms/step - loss: 0.0228 - accuracy: 0.9939 - val_loss: 0.0105 - val_accuracy: 0.9969
Epoch 19/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0212 - accuracy: 0.9932 - val_loss: 0.0174 - val_accuracy: 0.9955
Epoch 20/20
17959/17959 [==============================] - 20s 1ms/step - loss: 0.0237 - accuracy: 0.9923 - val_loss: 0.0247 - val_accuracy: 0.9927
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1db6ae5bdc8>

title_and_text_model.evaluate(test)
225/225 [==============================] - 3s 14ms/step - loss: 0.1119 - accuracy: 0.9689
[0.1119401752948761, 0.9689072966575623]
And we barely do better than the “only title” model, so we can declare this the winner.
I suppose one warning you hear quite a bit when you’re first studying machine learning is that “too much information can actually cause more harm than good”.
But in this case, we do have amazing accuracy on our unseen data from our model that has the most information possible. One idea I’d throw out there as to why is maybe the difference “feeling” of the headline in a faker, more clickbait-y article is very dramaticized and intentionally made more inflammatory and maybe the
Embeddings
Remember that embedding layer I said was important but we’d save talking about for later? Now is later.
Let’s just start by poking around the embedding layer at the end of model fitting. This is where that naming of layers comes in, we can just ask for it by name and open up the hood a bit to see the numbers inside with
weights = title_and_text_model.get_layer('embedding').get_weights()[0] # get the weights from the embedding layer
vocab = vectorize_layer.get_vocabulary() # get the vocabulary from our data prep for later
weights
array([[ 0.01516789, -0.02696777, 0.03883211, ..., 0.02343773,
0.01760186, 0.01586471],
[-0.0337828 , 0.03754019, -0.0383072 , ..., -0.0289585 ,
-0.02819344, -0.04658148],
[-0.00865335, 0.02707106, -0.10757445, ..., 0.03221235,
0.00225151, 0.01295694],
...,
[-0.41807944, 0.38946232, -0.40882814, ..., -0.5125998 ,
-0.4375395 , -0.43943298],
[-0.3980184 , 0.5073266 , -0.5625422 , ..., -0.38761407,
-0.5029228 , -0.37710074],
[-0.46357223, 0.48129812, -0.5456448 , ..., -0.48393884,
-0.3351998 , -0.4587918 ]], dtype=float32)
weights.shape
(2000, 20)
See? I really wasn’t kidding when I said each layer really just did have a bunch of (meaningful!) numbers. In fact, they’re so meaningful that we can actually gain some really nice insights on some of the associations between words that our model built for itself.
That’s right - we’re getting into what an embedding really is. This is the layer of weights for each word in the document where the neural network was learning which words were close in association with each other, where if you imagine each word (in our 2000-word vocabulary) a point in 10-dimensional space, with points physically closer being more related.

Image credit: Towards Data Science
We can even visualize this a bit ourselves since we have all those numbers laying right there in weights for us. We humans of course have trouble visualzing the very high amount of dimensions often involved in data+machine learning, so
the standard trick is to use PCA to reduce the number of dimensions to something more reasonable.
Note: I remember always hearing the term “dimension reduction” in machine learning circles, and being intimidated by it. I’m not going to pull the “but it’s a really simple concept” again, so instead I’ll just point you to this now somewhat famous PCA tutorial video that steps through the process at a nice learner’s pace.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(weights)
Now we’ll make a data frame from our results:
embedding_df = pd.DataFrame({
'word' : vocab,
'x0' : weights[:,0],
'x1' : weights[:,1]
})
embedding_df
| word | x0 | x1 | |
|---|---|---|---|
| 0 | -0.151776 | -0.027071 | |
| 1 | [UNK] | 0.046081 | -0.015569 |
| 2 | to | -0.034870 | 0.088164 |
| 3 | trump | -0.028481 | -0.045172 |
| 4 | in | -0.059436 | -0.018053 |
| ... | ... | ... | ... |
| 1995 | 14 | 0.415547 | -0.119405 |
| 1996 | “it’s | 0.978748 | 0.099301 |
| 1997 | “hillary | 1.296868 | -0.076418 |
| 1998 | “he | 1.399378 | 0.102498 |
| 1999 | “black | 1.379671 | 0.037937 |
2000 rows × 3 columns
import plotly.express as px
fig = px.scatter(embedding_df,
x = "x0",
y = "x1",
size = list(np.ones(len(embedding_df))),
size_max = 2,
hover_name = "word")
fig.show()
Isn’t that neat? The part that I’m most impressed of course that a machine could learn all this on its own (learning a lot about how humans speak even though we didn’t give it a dictionary or tell it about how language works or anything), but it could also make neat associations between words that were both close in meaning, or even was the subject of news that was closely related.
For example, something interesting we can take a look at is how where certain categories of words are physically located in this cloud.
far_foreign = ["vietnam", "phillipines", "chinas", "syria", "beijing", "brazil"]
domestic = ["us", "usa", "america", "american", "states", "hillary", "trump"]
def gender_mapper(x):
if x in foreign:
return 1
elif x in domestic:
return 4
else:
return 0
embedding_df["highlight"] = embedding_df["word"].apply(gender_mapper)
embedding_df["size"] = np.array(1.0 + 50*(embedding_df["highlight"] > 0))
import plotly.express as px
fig = px.scatter(embedding_df,
x = "x0",
y = "x1",
color = "highlight",
size = list(embedding_df["size"]),
size_max = 10,
hover_name = "word")
fig.show()
Interestingly, the one purple (foreign) word on the right hand side of the cloud is “Syria”. Perhaps this speaks a bit to the different nature of articles having to do with Syria (lots of foreign participants in war, instability) rather than the more standard foreign politics discussed with the other nations (trade/standard news there).